Summary

Automated detection of acoustic signals is crucial for effective monitoring of vocal animals and their habitats across large spatial and temporal scales. Recent advances in deep learning have made high performing automated detection approaches more accessible two more practitioners. However, there are few deep learning approaches that can be implemented natively in R. The ‘torch for R’ ecosystem has made the use of transfer learning with convolutional neural networks accessible for R users. Here we provide an R package and workflow to use transfer learning for the automated detection of acoustics signals from passive acoustic monitoring (PAM) data collected in Sabah, Malaysia. The package provides functions to create spectogram images from PAM data, compare the performance of different pre-trained CNN architectures, and deploy trained models over directories of sound files. The R programming language remains one of the most commonly used languages among ecologists, and we hope that this package makes deep learning approaches more accessible to this audience.

Statement of need

Passive acoustic monitoring

We are in a biodiversity crisis, and there is a great need for the ability to rapidly assess biodiversity in order to understand and mitigate anthropogenic impacts. One approach that can be especially effective for monitoring of vocal yet cryptic animals is the use of passive acoustic monitoring (Gibb et al. 2018), a technique that relies autonomous acoustic recording units. PAM allows researchers to monitor vocal animals and their habitats, at temporal and spatial scales that are impossible to achieve using only human observers. Interest in use of PAM in terrestrial environments has increased substantially in recent years (Sugai et al. 2019), due to reduced price of the recording units and improved battery life and data storage capabilities. However, the use of PAM often leads to the collection of terabytes of data that is time- and cost-prohibitive to analyze manually.

Automated detection

Some commonly used non-deep learning approaches for the automated detection of acoustic signals in terrestrial PAM data include binary point matching (Katz, Hafner, and Donovan 2016), spectrogram cross-correlation (Balantic and Donovan 2020), or the use of a band- limited energy detector and subsequent classifier, such as support vector machine (Clink et al. 2023; Kalan et al. 2015). Recent advances in deep learning have revolutionized image and speech recognition (LeCun, Bengio, and Hinton 2015 ), with important cross-over for the analysis of PAM data. Traditional approaches to machine learning relied heavily on feature engineering, as early machine learning algorithms required a reduced set of representative features, such as features estimated from the spectrogram. Deep learning does not require feature engineering (Stevens, Antiga, and Viehmann 2020) . Convolutional neural networks (CNNs) — one of the most effective deep learning algorithms—are useful for processing data that have a ‘grid-like topology’, such as image data that can be considered a 2-dimensional grid of pixels (Goodfellow, Bengio, and Courville 2016). The ‘convolutional’ layer learns the feature representations of the inputs; these convolutional layers consist of a set of filters which are basically two-dimensional matrices of numbers and the primary parameter is the number of filters (Gu et al. 2018). Therefore, with CNN’s there is no feature engineering required. However, if training data are scarce, overfitting may occur as representations of images tend to be large with many variables (LeCun, Bengio, and others 1995).

Transfer learning?

Transfer learning is an approach wherein the architecture of a pretrained CNN (which is generally trained on a very large dataset) is applied to a new classification problem. For example, CNNs trained on the ImageNet dataset of > 1 million images (Deng et al. 2009)such as ResNet have been applied to automated detection/classification of primate and bird species from PAM data (Dufourq et al. 2022; Ruan et al. 2022). At the most basic level, transfer learning in computer vision applications retains the feature extraction or embedding layers, and modifies the last few classification layers to be trained for a new classification task (Dufourq et al. 2022).

‘torch for R’ ecosystem

‘Keras’ (Chollet and others 2015), ‘PyTorch’ (Paszke et al. 2019) and ‘Tensorflow’ (Martín Abadi et al. 2015) are some of the more popular neural network libraries; these libraries were all initially developed for the Python programming language. Until recently, deep learning implementations in R relied on the ‘reticulate’ package which served as an interface to Python (Ushey, Allaire, and Tang 2022). However, the recent release of the ‘torch for R’ ecosystem provides a framework based on ‘PyTorch’ that runs natively in R and has no dependency on Python (Falbel 2023). Running natively in R means more straightforward installation, and higher accessibility for users of the R programming environment. Keydana (2023) provides tutorials for transfer learning in the ‘torch for R’ ecosystem, and the functions in ‘gibbonNetR’ rely heavily on these tutorials.

Overview

This package provides functions to create spectrogram images, use transfer learning from six pretrained CNN architectures (AlexNet (Krizhevsky, Sutskever, and Hinton 2017) , VGG16, VGG19 (Simonyan and Zisserman 2014), ResNet18, ResNet50, and ResNet152 (He et al. 2016)), evaluate model performance, deploy the highest performing model over a directory of sound files, and extract embeddings from trained models to visualize acoustic data. We provide an example dataset that consists of labelled vocalizations of the loud calls of four vertebrates from Danum Valley Conservation Area, Sabah, Malaysia.

Usage

First we create spectrogram images

Spectrograms.
Spectrograms.

Then we train the model

Train the models

Training the models using gibbonNetR and evaluating on a test set

# Location of spectrogram images for training
input.data.path <-  'data/examples/'

# Location of spectrogram images for testing
test.data.path <- 'data/examples/test/'

# User specified training data label for metadata
trainingfolder.short <- 'danummulticlassexample'

# We can specify the number of epochs to train here
epoch.iterations <- c(20)

# Function to train a multi-class CNN
gibbonNetR::train_CNN_multi(input.data.path=input.data.path,
                            architecture ='resnet50',
                            learning_rate = 0.001,
                            class_weights = c(0.3, 0.3, 0.2, 0.2, 0),
                            test.data=test.data.path,
                            unfreeze.param = TRUE,
                            epoch.iterations=epoch.iterations,
                            save.model= TRUE,
                            early.stop = "yes",
                            output.base.path = "model_output/",
                            trainingfolder=trainingfolder.short,
                            noise.category = "noise")

Evaluating model performance

Specify for the ‘female.gibbon’ class

# Evaluate model performance
performancetables.dir <- "model_output/_danummulticlassexample_multi_unfrozen_TRUE_/performance_tables_multi"

PerformanceOutput <- gibbonNetR::get_best_performance(performancetables.dir=performancetables.dir,
                                                      class='female.gibbon',
                                                      model.type = "multi",Thresh.val=0)

Examine the results

PerformanceOutput$f1_plot
PerformanceOutput$best_f1$F1

Use the pre-trained model to extract embeddings and use unsupervised clustering to identify signals

The use of embeddings has been shown to be an effective way to represent acoustic signals (Lakdari et al. 2024 ; Sethi et al. 2020).

Extract embeddings


ModelPath <- "/Users/denaclink/Desktop/RStudioProjects/gibbonNetR/model_output/_danummulticlassexample_multi_unfrozen_TRUE_/_danummulticlassexample_20_resnet50_model.pt"
result <- extract_embeddings(test_input="/Users/denaclink/Desktop/RStudioProjects/gibbonNetR/data/examples/test/",
                                      model_path=ModelPath,
                                     target_class = "female.gibbon")

We can plot the unsupervised clustering results

result$EmbeddingsCombined

We can output the NMI results, and the confusion matrix results when we use ‘hdbscan’ to match the target class to the cluster with the largest number of observations

result$NMI
result$ConfusionMatrix
Unsupervised clustering.
Unsupervised clustering.

References

Balantic, Cathleen, and Therese Donovan. 2020. “AMMonitor: Remote Monitoring of Biodiversity in an Adaptive Framework with r.” Methods in Ecology and Evolution 11 (7): 869877.
Chollet, François, and others. 2015. “Keras.” https://keras.io.
Clink, Dena J., Isabel Kier, Abdul Hamid Ahmad, and Holger Klinck. 2023. “A Workflow for the Automated Detection and Classification of Female Gibbon Calls from Long-Term Acoustic Recordings.” Frontiers in Ecology and Evolution 11. https://www.frontiersin.org/articles/10.3389/fevo.2023.1071640.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In, 248255. Ieee.
Dufourq, Emmanuel, Carly Batist, Ruben Foquet, and Ian Durbach. 2022. “Passive Acoustic Monitoring of Animal Populations with Transfer Learning.” Ecological Informatics 70: 101688. https://doi.org/https://doi.org/10.1016/j.ecoinf.2022.101688.
Falbel, Daniel. 2023. Luz: Higher Level ’API’ for ’Torch’. https://CRAN.R-project.org/package=luz.
Gibb, Rory, Ella Browning, Paul Glover-Kapfer, and Kate E. Jones. 2018. “Emerging Opportunities and Challenges for Passive Acoustics in Ecological Assessment and Monitoring.” Methods in Ecology and Evolution, October. https://doi.org/10.1111/2041-210X.13101.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, et al. 2018. “Recent Advances in Convolutional Neural Networks.” Pattern Recognition 77: 354377.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In, 770778.
Kalan, Ammie K., Roger Mundry, Oliver J J Wagner, Stefanie Heinicke, Christophe Boesch, and Hjalmar S. Kühl. 2015. “Towards the Automated Detection and Occupancy Estimation of Primates Using Passive Acoustic Monitoring.” Ecological Indicators 54 (July 2015): 217226. https://doi.org/10.1016/j.ecolind.2015.02.023.
Katz, Jonathan, Sasha D Hafner, and Therese Donovan. 2016. “Assessment of Error Rates in Acoustic Monitoring with the r Package monitoR.” Bioacoustics 25 (2): 177196.
Keydana, Sigrid. 2023. Deep Learning and Scientific Computing with r Torch. CRC Press.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2017. “Imagenet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6): 8490.
Lakdari, Mohamed Walid, Abdul Hamid Ahmad, Sarab Sethi, Gabriel A Bohn, and Dena J Clink. 2024. “Mel-Frequency Cepstral Coefficients Outperform Embeddings from Pre-Trained Convolutional Neural Networks Under Noisy Conditions for Discrimination Tasks of Individual Gibbons.” Ecological Informatics 80: 102457.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.
LeCun, Yann, Yoshua Bengio, and others. 1995. “Convolutional Networks for Images, Speech, and Time Series.” The Handbook of Brain Theory and Neural Networks 3361 (10): 1995.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” https://www.tensorflow.org/.
Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” In, 80248035. Curran Associates, Inc. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Ruan, Wenda, Keyi Wu, Qingchun Chen, and Chengyun Zhang. 2022. “ResNet-Based Bio-Acoustics Presence Detection Technology of Hainan Gibbon Calls.” Applied Acoustics 198: 108939. https://doi.org/https://doi.org/10.1016/j.apacoust.2022.108939.
Sethi, Sarab S, Nick S Jones, Ben D Fulcher, Lorenzo Picinali, Dena Jane Clink, Holger Klinck, C David L Orme, Peter H Wrege, and Robert M Ewers. 2020. “Characterizing Soundscapes Across Diverse Ecosystems Using a Universal Acoustic Feature Set.” Proceedings of the National Academy of Sciences 117 (29): 17049–55.
Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv Preprint arXiv:1409.1556.
Stevens, Eli, Luca Antiga, and Thomas Viehmann. 2020. Deep Learning with PyTorch. Simon; Schuster.
Sugai, Larissa Sayuri Moreira, Thiago Sanna Freire Silva, José Wagner Ribeiro, and Diego Llusia. 2019. “Terrestrial Passive Acoustic Monitoring: Review and Perspectives.” BioScience 69 (1): 1525. https://doi.org/10.1093/biosci/biy147.
Ushey, Kevin, J. J. Allaire, and Yuan Tang. 2022. Reticulate: Interface to ’Python’.